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ABSTRACT 

We consider the task of distinguishing between two different alternative models that 
can roughly equally explain observed time series data, mainly focusing on the period 
ambiguity case (aliasing). We propose a test for checking whether the rival models 
are observationally equivalent for now or they are already distinguishable. It is the 
Vuong closeness test, which is based on the Kullback-Leibler Information Criterion. 
It is asymptotically normal and can work (in certain sense) even in the misspecified 
case, when the both proposed alternatives are actually wrong. This test is also very 
simple for practical use. We apply it to several known extrasolar planetary systems 
and find that our method often helps to resolve various model ambiguities emerging 
in astronomical practice, but preventing us from hasty conclusions in other cases. 
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1 INTRODUCTION 

The search for periodicities is one of the most basic tasks of 
the observational data analysis. This task emerges in all or 
almost all branches of astronomy (and even not only astron- 
omy), which deal with observational (or experimental) data. 

■ This task is usually solved by means of the periodogram- 
. !^ ', based approach. In this approach, one considers certain func- 

■ tion of a period (or frequency), which basically represents 
5^ ■ some estimation of the power spectrum of the observed pro- 

, cess. Such estimating function is traditionally called a pe- 
riodogram, and there are many types of periodograms that 
are used in practice. The lLombI |l976l ') - IScarglel (llQSd ) pe- 
riodogram is a popular choice, for instance. 

The observed data are usually noisy and, consequently, 
such data produce noisy periodograms. This implies that 
some statistical issues should be usually resolved during the 
period search procedure. The first such issue is the determi- 
nation of the statistical significance of the extracted period- 
icities. This problem is a lready investigated i n adv ance by 
various authors, see e.g. (jFrescura et al.l [20081 ) and (jBaluevI 
[2008. ) for a recent review. 

The present work is devoted to another issue, which 
arises when the original data are not spaced uniformly in 
time. In astronomy, it is a frequent case when the obser- 
vations a gapped in time due to several natural phenom- 
ena like the day/night cycle, moonlight contamination, etc. 
It is well-known that such data produce so-called 
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that is false stroboscopic periods contaminating the peri- 
odograms. Sometimes, it is even difficult to say, which pe- 
riodogram peak is the real one and which is its alias. Our 
paper will focus on this situation of the alias ambiguity. The 
periodograms themselves are usually easily calculated using 
pretty simple formulae. Recent results (jBaluevI l2008l ) also 
allow very simple (but rigourously justified) assessing of the 
significance of the extracted periods. However, there is a 
lack of similarly simple and rigorous methods solving the 
alias discrimination task. 

In a particular application, we usually have, in addition 
to the raw time series data, some prior information concern- 
ing the phenomenon under research. This information often 
can be used to select one of the available alternatives, or 
at least to retract some of them, making the final pool of 
solutions more narrow. For example, in the exoplanet ra- 
dial velocity searches, a popular approach involves various 
dynamical stability or regularity criteria dGozdziewski et al.l 
l2006l : iGozdziewski et al.ll2007l : iGozdziewski et al.ll2008l ). No 
doubt, such methods are useful, but we would often prefer 
not to bind ourselves to any external assumptions as much 
as possible, allowing the data to speak for themselves. For 
example, in the case of exoplanet searches, the apparent ra- 
dial velocity patterns can often be caused by stellar activity 
effects rather than by orbiting planets. In such a case any 
dynamical criteria may be unreliable, since we cannot be 
sure even that a given radial velocity oscillation is indeed 
induced by a really existing planet. 

In other words, in this work we put a goal to find a 
purely statistical method of alias discrimination, that could 
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be maximally independent on v arious prior assumptions. W e 
may highlight a recent work bv lOawson fc Fabrvckvll|20ld ). 
where the authors propose to take into account the phases 
of candidate aliases, when choosing between them. However, 
this approach is not very rigorous, because it basically does 
not take into account any possible noisy fluctuations of these 
phases, as well as of the aliases amplitudes. In addition, this 
approach is not easy to apply in an automated way, since it 
requires an active participation of the researcher. 

The source of the lesser attention to this problem prob- 
ably comes from the fact that to distinguish aliases we need 
to compare two non-nested models of the data, while most 
of the traditionally used statistical criteria are usually de- 
signed to choose between nested models. For example, the 
signal detection task requires to test some "base" model 
against a more general "base-|-signal" model, which includes 
the original base model as a partial case. The alias ambigu- 
ity does not infer nested models: none of the two alternative 
periodicities can encompass another one as a partial case. 
Such situation usually falls out of a typical handbook on 
mathematical statistics. Nevertheless, we find that there are 
many individual statistical works dealing with tests for non- 
nested hypotheses, so this statistical problem appears well- 
studied. Our goal in this paper is to apply these results to 
the astronomical task of resolving the alias ambiguity, and 
to construct the appropriate period distinguishing criteria. 

The problem is also complicated because we often do 
not have an accurate model of the real periodic variation. 
In practice, both alternative models (e.g. simple sinusoids) 
may be only approximate, and our analysis cannot base on 
an assumption that one of them is strictly true. Therefore, 
the method of the analysis should remain valid in the so- 
called misspecified case, when none of the available models 
can mimic the true signal precisely, but we are still interested 
to find out which alternative is better. 

The plan of the paper is as follows. In Section [2j we 
provide a more clear formulation of the problem, what we 
actually want to derive from our analysis. In Section [3l we 
describe the basic formulae and ideas of the Vuong statisti- 
cal test, suitable for solving our task. In Section |4l we pro- 
vide the final practical formulae of this test. In Section [5] 
we assess its reliability and efficiency by means of numerical 
Monte Carlo simulations. Finally, in Section[51 we apply our 
methods to the analysis of the radial velocity data of sev- 
eral extrasolar planetary system, which demonstrate various 
alias ambiguities concerning the planetary periods or other 
model parameters. 



2 FORMULATION OF THE PROBLEM 

From the first view point, the task can be easily formulated. 
Assume we have two main peaks in the periodograms, hav- 
ing similar height. We have already established that the pe- 
riodic signal is significant indeed (at least the larger peak or 
both peaks are significant), and our first naive question is: 
which peak we ought to recognise as the true one and which 
should be accused as its alias? A naive question has a naive 
answer: we should adopt the taller peak as a more likely can- 
didate. Then the second question arises: how we can ensure 
that such choice is statistically justified? In other words, is 
the observed difference between the two peaks statistically 



significant? If it is not then any attempt to choose between 
the two models is, in fact, no better than dropping a coin. In 
such situation, we can occasionaly choose the true period, 
but such luck cannot be classifled as a success of the anal- 
ysis, since we cannot ensure that we were so lucky indeed. 
Therefore, we should make available a third, inconclusive, 
outcome of the analysis. This makes the problem qualita- 
tively more complicated, because more types of decisions 
infer more freedom for making mistakes. 

In the framework of the symmetric hypotheses testing 
with no inconclusive decisions, we have only two possible 
outcomes, and basically there is only one type of mistakes: 
accepting a wrong model. In practice, the two models are 
often treated unequally, however. For example, the first one 
may be considered as more reliable, while the second one 
is admitted only hypothetically. In that case we have two 
formal types of mistakes, but the mistakes to wrongly reject 
the base model are naturally considered as more dangerous. 
Thus, when applying the traditional statistical tests we first 
limit the probability of this more dangerous "false alarm", 
and only then we try to minimize the mistakes of the sec- 
ond kind, if it is possible. The two types of mistakes are 
antagonistic: suppressing the frequency of false alarms leads 
to increased fraction of wrong non-rejections of the null hy- 
pothesis. 

With three decisions at disposal, the following types of 
mistakes threaten: (I) choosing wrong model when another 
one is actually better, (II) not choosing any model when one 
is actually better, and (III) choosing a model when objec- 
tively none is really better. The class (I) formally contains 
two subclasses: to wrongly accept the first or the second 
model. We consider a symmetric case here, when none of 
the alternative models is a priori preferable, so these two 
subclasses are qualitatively equivalent to each other and 
can be united into one. The three remaining types of mis- 
takes are already qualitatively different. We cannot simul- 
taneously suppress mistakes of all three types, and we can- 
not say which one is most dangerous in general. It depends 
on the current practical circumstances: sometimes we can- 
not tolerate a misclassification, or sometimes an inconclusive 
answer is highly undesirable. The methods of our paper are 
designed to suppress the "pink- vision" mistake (III), i.e. the 
cases when we wrongly assume that the two rival models are 
distinguishable on the basis of the available observations. 
The main arguments supporting this choice are: 

(i) Such approach allows for an intuitive formalisation of 
the "statistically justified" choice. We choose any of the 
alternatives only after we have ensured that they are not 
observationally equal. Otherwise, we acknowledge that any 
choice would be too risky for now, and we should live with 
such ambiguity until more observations resolve it. 

(ii) Frequently, none of the proposed models may be cor- 
rect, and a more accurate model should be constructed to 
replace both. This task usually cannot be fulfilled reliably 
until we are ensured that the two original models can be dis- 
tinguished. It is usually more justified to seek for any theory 
update only after we managed all its current ambiguities. 

(iii) Consider the mutual interaction between the three 
possible types of the mistakes. For example, suppressing 
(III), i.e. generating more inconclusive answers, will likely 
suppress (I) and favours to (II), while suppressing (II) ob- 
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viously favours to both (I) and (III). In other words, (II) 
is antagonistic to (I) and (III), while (I) and (III) are not 
mutually antagonistic. Thus it should be technically conve- 
nient to first suppress (I) or (III), which should be broadly 
equivalent. 

(iv) Putting the third-type mistakes in the first place al- 
lows us to easily reformulate the problem in the classical 
hypothesis testing framework, operating in the traditional 
terms (to a certain point). Indeed, our main task is now to 
test the null hypothesis "the observed divergence between 
the two models is consistent with random noise" against 
the alternative "this divergence is beyond the random noise 
level". 

We still need to formalise the notion of "observational 
equivalence" of the rival models. The problem is complicated 
by the absence of any guarantee that at least one of the 
alternative models is indeed correct. In true, the signal may 
have a non-sinusoidal shape (while, e.g., the Lomb-Scargle 
periodogram implicitly assumes that it is always sinusoidal), 
and the data may contain some extra small periodic or non- 
periodic variations. We usually want to know, which model 
fits the observations better, rather than which one is true. 
Therefore, our test should not rely on an assumption that 
one of the proposed models is correct^ 

We find that the Vuong test (|Vuonel[l989l ) suits our 
needs very well. This method utilises the KuUback-Leibler 
Information Criterion (KLIC) to define which of the rival 
models is "better". 



3 FUNDAMENTALS OF THE TEST 

Let us consider two rival models of the data, fii{t,6i) and 
/U2(t, 62) parametrised by the vectorial arguments Gi and 
02- In the period ambiguity case these models (as well as 
the parameter vectors) are functionally the same for both 
alternative models, the only difference is due to the differ- 
ent values of the frequency (which is embedded inside ©1,2). 
Nonetheless we still use a bit more general notation then is 
necessary for our aliasing case. In general, 61 and G2 may 
be even unrelated to each other, having different dimen- 
sions. Given the timings ti, measurements Xi, and expected 
(known) measurement uncertainties ai (for i — 1,2, . . . N), 
we may assume (just for some definiteness) that the proba- 
bility density function of each measurement is Gaussian: 



fi{xi\ti,ai,Oi) = 



exp 



Xi - fll{ti, 0l) 



(1) 



with a similar definition /2 for the second rival model, ^2- 
For further shortness, we define the pairs Zi — {ti,ai} and 
two vectors: x containing all Xi and z containing all Zi (that 
is, ti and at). These model probability density functions 
/i,2 depend on the unknown vectors of parameters 61,2, 
which can be estimated from the data using the maximum- 
likelihood approach, which turns into the least-square one 
in the Gaussian case. Namely, the necessary estimations of 
01,2 can be defined as 



where the functions 

N 

Ck{x\z,6k) = fk{xi\zi,dk) 



xl{x\z,6k) 



i = l 
N 

E 



Xi - fJ,k{ti, 6k) 



(3) 



represent the appropriate likelihood functions and the chi- 
square functions, respectively. Note that the likelihood func- 
tion also represents the joint probability density for the vec- 
tor of the observables x, given fixed timings, error uncer- 
tainties, and model parameters. 

Let us now try to rigorously compare the two alternative 
models. IVuond l| 19891 ) suggests to use the Kullback-Leibler 
Information Criterion (KLIC) to compare the models in the 
sense of their separation from the truth: 



KLICfc(0fc) 



KLICi2(6>i,6»2) = 



log 



h{x\z) 



}k{x\z, Ok 



-h(x, z) dxdz - 



, Hx\z) 

fk{x\z,eu) 

KLIC2(6>2) -KLICi(6»i) = 

fj^f^M^,.)dxdz 

f2(x\z,e2) 

E".,.log-^^("l^'^^) 



log 



/2(x|2,02)' 



(4) 



0fc = arg max £k(x\z,6k) = arg min Xk{sc\z,Gk) 



(2) 



where h{x, z) represents the true (unknown) joint probabil- 
ity density of a single measurement x and of the associated 
quantities in z, and h{x\z) is the respective conditional den- 
sity of X given z. This true distribution h involves, the true 
signal shape and true error distribution shape, instead of 
the modelled ones. The quantities KLICi and KLIC2 assess 
the separatiorQ between one of the two available statistical 
models of the data and the truth. The difference between 
these quantities, KLIC12, measures the ability of the first 
model to reproduce the true distribution, relatively to the 
second one. The definitions in Q do not yet involve any- 
thing from the actual observational data, they are defined 
regardless of what we observe. The symbol denotes the 
mathematical expectation, taken for the true distribution h. 

It is not required that the model error distribution 
shape set in the functions fk should be close to h. In prac- 
tice, we usually have no other option than to set fk to the 
Gaussian functional shape ((TJ. Then KLIC12 will measure 
the separation between /xi and fi2 in the sense of their ex- 
pected mean-square residuals. Such separation measure re- 
mains quite sensible and justified even when h(x\z) is non- 
Gaussian. Moreover, even if we know definitely the non- 
Gaussian shape of h, we may want to compare the mod- 
els in a homogeneous manner, still using the mean-square 
residual for this goal, i.e. assuming Gaussian fk- We may 
want to avoid any binding to the true shape of h as much 
as possible, because it is often related only to instrumental 
properties rather than to physical objects that we observe. 

In general, there is no "true" values of 0i and O2 that 
we could substitute in Q, since both rival models /i,2 are 



^ We say "separation" instead of "distance" , since this divergence 
measure does not satisfy all necessary axioms of the distance, e.g. 
it is not symmetric and does not satisfy the triangle inequality. 



4 R.V. Baluev 



wrong and the true distribution h is not parametrized at all. 
We should substitute the so-called "pseudo-true" values of 
01 and 62 , which represent some theoretically most suitable 
values of these parameters. They are defined as the points 
where KLICi or KLIC2 reaches their maximums. Since the 
true density h{x, z) is unknown, the pseudo-true values of 
01,2 are unknown too, but the estimations ([2]) can approxi- 
mate them. Eventually, we want to test the null hypothesis 
KLIC12 = (the two models are equally close to the truth) 
against the alternative KLIC12 7^ 0. 

It is essential that in the definition (|4]) the timings and 
error uncertainties inside Zi are treated as random quantities 
too, so we can speak of the joint density h{x, z). Such prob- 
abilistic interpretation of apparently non-random quantities 
is in fact quite natural, and we consider it as a strength of the 
approach. In practice, we often do not know the exact time 
of each observation in advance. This time depends on many 
things that are unrelated to the astronomical problem, like 
the observatory's time allocation policy and current routine 
circumstances, the racing between different observing pro- 
grammes, etc. This issue is also discussed in more details 
in Appendix [X] It is important that we are not required to 
specify/estimate h{x,z) explicitly, the test will deal with it 
automatically. Also, we do not make any restrictive assump- 
tions about the distribution of the timings ti: it is allowed 
to be highly non-uniform, e.g. gapped (as it usually occurs 
in astronomy). 

Now, it is not difficult to realize that the following nor- 
malized log-likelihood ratio 



1 , Ci{x\z,0i) 

L = -TT log — 

N £.2{x\z,e2) 



1 



fi{x,\z„ei) 

f2{Xi\z^,02) 



(5) 



represents an empirical estimation of the KLIC divergence 
measure (U), since the averaging over h approximates the 
mathematical expectation in Q. Note that the individual 
terms k axe generated by the real data, so they are automat- 
ically averaged on the basis of the unknown true distribution 
h. This trick is rather reminiscent of the well-known jack- 
knife or bootstrap procedures, which also use the original 
sample to eliminate the necessity to specify the unknown 
error distribution. 

Furthermore, we can find the empirical variance of k as 



1 2 if'' 

i—l \z— 1 



(6) 



The uncertainty of L is then equal to v/y/N, and therefore 
the final Vuong statistic represents another re-normalised 
log-likelihood ratio 



V 



V'^ I 



Y^iv ,2 L / Y^iv , 



(7) 



Under the null hypothesis, V behaves asymptotically (for 
A'^ — >■ 00) as a standard normal varaible (mean zero, variance 
unit). This result can be used to test the models equivalence. 
If |V| is too large then our null hypothesis KLIC12 = is 
inconsistent with the data, so that the rival models are well- 
distinguishable and we can safely adopt the one which offers 
a better likelihood (i.e., the first one if V > 0, or the second 
one otherwise). If V is not large enough then we have to ac- 
knowledge that for now it is still too risky to choose between 



the alternatives and it is better to seek for more data before 
drawing any definite conclusion. Given an observed value of 
V, we can calculate the associated false alarm probability as 
FAP = 2$(|V|), where <&(x) is the standard Gaussian tail 
function (i.e., probability for a standard normal varaible to 
exceed a given x). To reject the null hypothesis, we need to 
have this FAP below some small critical value FAP* (typi- 
cally, 1%, 5%, etc.). 

The formula (O actually refers to the case when both 
models have the same number of free parameters (degrees of 
freedom). If it is not the case, we need to add a minor bias 
correction, because models with larger number of parame- 
ters always produce systematically better fits tha n models 
with smaller number of parameters. IVuond (|l989l ) suggests 
to add a Bayesian-style correction as 



■In AT 



(8) 



I? 



I 



with d\ and d2 being the numbers of free parameters in 
the models. This correction tends to zero when N <x, 
although may remain relatively important in practice, espe- 
cially when di — d2 is large0 In this paper we only deal with 
models having the same number of parameters, so we will 
use only the uncorrected formula 

Since the Vuong test is based on the usual likelihood 
ratio statistic, it is interesing to compare it with the tradi- 
tional methods for testing nested hypotheses, when the first 
model represents a parametric subspace of the second one. 
This traditional problem basically represents a degenerate 
case in the conditions of the non-nested hypotheses testing, 
and the distribution of V is then not asymptotically normal. 
It is well-known that in that case the quantity LN follows a 
chi-square distribution (if the first model is correct). As it is 
discussed in the previous paragraphs, in the non-nested case 
the quantity L\/N is asymptotically normal. Therefore, the 
main difference between the nested and non-nested situa- 
tions is in the magnitude of the random scatter in L, about 
~ 1/A or about ~ l/^fN. 

We need to emphasize that the Voung statistic is not 
equivalent to the textbook likelihood ratio in ((5]). The only 
common thing between L and V is their sign. This means 
that the Vuong test uses in fact the likelihood function to 
identify which of the models is more likely. However, as we 
explained in Sect. [2] our main goal was to find out whether 
the alternatives are distinguishable or they are not. The 
likelihood ratio method is not usable for this, because its 
distribution in this situation is not known in general. It is 
only known that the distribution of L is asymptotically nor- 
mal, but its variance is severely dependent on the particular 
data/error mo del in /, an d should be evaluated on the task- 
by-task basis (|Coxlll962l '). The Vuong test involves a nor- 
malization that makes the distribution of V asymptotically 
invariable with respect to the model /. 



^ When passing to the limit TV — > 00, the values of L in JSjl and 
V in ^ tend to certain constant finite values. Namely, L tends 
to the KLIC defined in ((JJ, and v tends to a similar expression 
with mathematical expectation replaced by the variance operator. 
Bearing this in mind, we easily find that the difference between 10 
and JSjl decreases as In N/\/~N. 
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Another very important consequence is that the Vuong 
test is asymptotically insensitive to possible model faults, in 
the sense that V preserves its standard normal distribution 
even if the true signal and error behaviour does not really 
match the model /. Such possible shortcomings of / may 
have two distinct reasons: (i) wrong assumptions about the 
signal shape and (ii) wrong assumptions about the measure- 
ment error distribution. 

In the first case, we try to fit the observed signal using 
inadequate models. In this case our test cannot suggest any 
third (correct) model, this task still lies on the researcher's 
shoulders. However, it can advice us whether the models 
at disposal are distinguishable or equivalent. The answer to 
this question is often important even for inaccurate models, 
because if these simpler models are indistinguishable then 
there is a little chance that some other more complicated 
model may appear more suitable (and at least observation- 
ally distinguishable from the original models). In this case, it 
is important that both Vuong test does not assume that any 
of the available alternatives is true (while, e.g., in the clas- 
sic base/alternative hypothesis testing it is always assumed 
that either first or at least the second model is functionally 
correct) . 

In the second case, we may assume inadequate model 
for the error distribution. This is not an obstacle for our 
test at all: the asymptotic distribution of the statistic V re- 
mains the same for the most well-behaving distributions of 
the measurement errors (possibly except for some too heavy- 
tailed ones th at always con stitute a trouble, see all rigorous 
formulation in IVuonidll989l V As we noted above, in practice 
it is even desirable not to bind our ordering criterion (KLIC) 
to the shape of the error distribution, since this shape is usu- 
ally related to the instrumental characteristics rather than 
to the physical phenomenon under study. For example, set- 
ting / to a Gaussian bell shape we always order our sig- 
nal models in the sense of the mean-square deviation. When 
/ does not match the true shape of the error distribution, 
the quantity in Eq. ((SJ and, consequently, V are related to 
the so-called pseudo maximum-likelihood tests that possess 
many practically useful properti es of the classic likelihood 
ones (|Gourieroux et al.' 1984; B aluevll2009l '). 

Nonetheless, the Vuona, (,198^ ) paper contains a possi- 
bly essential requirement of the statistical independence of 
the pairs {zi,Xi) for different i. Therefore, both the Vuong 
test may yield unreliable results, if the measurement noise 
is correlated (not white). 



4 PRACTICAL FORMULAE 

For the Gaussian distribution ([l]), the Vuong statistic can 
be derived from substituting 



IJ.{t,0) = a cos(cjt) + bsm{ijjt), — {a, 6, oj}. 



(10) 



k = 



(pi{ti) - 'p.2{ti)) 



lii{t) = iii{t,9i), M2(i) = M2(i, 62). 



(9) 



These formulae can be used after substitution of the best 
fitting alternative models /2i and p.2 that we wish to compare 
(these models are task-specific). 

In the simplified aliasing case, we have formally the 
same model for the both alternatives, which represents a 
sinusoidal oscillation: 



The two alternatives are different only due to the different 
admissible ranges for the frequency: for the first model, fii, 
the (circular) frequency uj should be located around the first 
rival frequency, uji , and for the model ^2 it should be around 
UJ2- In practice, it is more convenient to treat fii and ^2 as 
different models, rather than two variants of the model (|10|) . 
because of the strong non-linearity of the frequency param- 
eter. The test that we have described has asymptotic nature 
(for — > 00), consequently it implicitly utilise some hidden 
linearisation of the models in the vicinity of the best fitting 
estimations. Although the coefficients a and b are fully lin- 
ear for all values of to, the frequency u itself is not globally 
linear and is not linearisable in the global sense, although it 
is still linearasable in the local sense. Therefore, we should 
treat the basic model jj, in the vicinities u ~ Qi and u ~ UJ2 
as two separate models. 

Let us write down the classical expression for the Lomb- 
Scargle periodogram: 



J2 fi cosuj{ti - t) 



+ 



^ sinuj [ti — t] 
J2 sin^ oj{ti - t) 



tan 2ujr — 



sin 2u)ti 
cos 2ujti 



(11) 



Note that r is also a function of cj. As it is already well- 
known, the Lomb-Scargle periodogram is directly related to 
the likelihood ratio st atistic, or, in the Gaussian cas e , to the 
chi-sq uare statistic jZechmeister fc Klirstej l2009l : iBaluevI 
l2008i). In addition to the periodogram itself, we will need 
the expressions for the associated best fitting coefficients a 
and b of the original model (|10|) . They are given by 



a(cj) 
b{u) 



Y cos^ uj{ti — r) ' 

Y sinuj{ti — r) 



(12) 



and the final best fitting sinusoidal model evaluates to 

p.{t, uj) — a{(jj) cosuj{t — t) + b{(jj) sm(jj{t — t). (13) 

From a preceding period analysis we should have al- 
ready estimated the possible rival frequencies Qi and 0)2, 
which correspond to the two largest peaks of the peri- 
odogram. Substituting these frequencies in (|13|l . we obtain 
the two best fitting models /2i(i) and p.2{t), which allow 
to evaluate all quantities h using (|9]), and then the Vuong 
statistic ((7]). In particular, it can be readily shown that 



NL = ^ h = z{Cji) — z{Q2), 



(14) 



which is not surprising, since the sum of k represents the 
pure log-likelihood ratio statistic, and the Lomb-Scargle pe- 
riodogram is tied to the log-likelihood function. 

This is not yet the full story. In practice, the quantities 
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(Ji usually are not known with good precision, and we need to 
model them too. For instance, the multiplicative model af = 
ft/wi is widely used, where the factor k is an extra unknown 
parameter, and Wi are the known statistical weights. In that 
case, the quantities U are more complicated, relatively to ((9]): 

_ N (V2{U) _ V^{U)\ 1 xl 



N 



Vk{t) = w{t){x(t) - 'jlk(t)) , Xk 



In the case of the sine curve (|10[) we obtain 

N 



NL 



Nk 



(23(tDl) - 23(22)) , 



(15) 



(16) 



where the periodogram 23 is defined in l|Baluevll2008l ). The 
number Nk. is equal to N — die, where die is the number 
of free parameters in the full model with the probe sinu- 
soid (|10|) . excluding the frequency parameter. For the cases 
that we consider here we always have die = 2 (two parame- 
ters a and 6), but for more general cases, which involve an 
underlying variation in addition to the periodic signal (see 
[Balucv 2008), the number die may be larger. 

Finally, we may use another parametrization of the 
measurement uncertainties, like e.g. the so-called additive 
model ffjip ) = P + CTmsaa .?: with a free parameter p, as con- 
sidered in (Baluevl lioogr ). In this case the formula for the 
quantities k will look like: 



2 



^2(^0 



0"i (p2 



Ml(tO 



CTi (pi 



+ log 



(Jj (P2 

o-i(pi 



(17) 



In this expression, pi,2 represent the estimations of the pa- 
rameter p associated to one of the models. 

In practice, we will probably use more complicated 
signal models than a sinusoidal one, of course. Just for 
example, we might want to add to our sinusoidal model 
some underlying variation, e.g. a simple constant term or 
a linear or quadratic trend, as it is done i n the general- 
ized least-square periodogra m described in l|Baluevl l2008l : 
IZechmeister fc Kiirsted 120091 ). We omit the detailed formu- 
lae for these cases, since they would be relatively bulky for 
a presentation here, and the reader can now easily derive 
them himself. 



5 PRACTICAL APPLICABILITY OF THE 
METHOD 

Considering a statistical test from the view point of its prac- 
tical applicability, we usually ask two main questions, 

(i) Concerning the behaviour of the test under the null 
hypothesis: How precisely we can estimate the false alarm 
probability, associated with an observed value of the test 
statistic? 

(ii) Concerning the behaviour of the test under the alter- 
native: Given an accurate FAP estimation, how sensitive is 
our test to practically expected deviations from the null hy- 
pothesis? 

Both issues can be addressed by means of numerical simu- 
lations, which is the goal of this section. 



5.1 Test reliability 

For the first issue, we need to specify some model condition 
satisfying the null hypothesis, then run a Monte Carlo simu- 
lation procedure (generating artifical random measurement 
errors), counting how frequently we meet the false alarms 
(the events when our test wrongly rejects the null hypoth- 
esis), and then to compare the simulated significance value 
with the expected one. In our case, we need first to construct 
some test signal that produces two alternative observed pe- 
riods, for which KLIC12 = 0. It is not so easy as it may 
seem, because simultaneously the rival periods should not 
be indistinguishable in principle. For example, for strictly 
evenly spaced observations, ti — iAt, each periodicity can 
be equally treated as having the original frequency u or 
any alias frequency cu + 27rfc/At for any integer k. All these 
alternative interpretations are fully equivalent and obser- 
vationally indistinguishable under any circumstances, since 
they generate exactly the same sequence of values at ti . The 
Vuong statistic is not defined for this case at all, because all 
li in ((9)1 are identically zero. This is not the situation that we 
want to test, since in practice we need to compare only po- 
tentially distinguishable alternatives. As a more realistic test 
model, we can consider a non-degenerate aliasing (the data 
are gapped in time rather than strictly evenly spaced), an 
ojo-periodic signal, and its two primary aliases aji,2 = too^i^g 
(with ujg being the data gapping frequency). If we neglect 
the main frequency u, these peer aliases provide practically 
equal interpretation of the observationslfl 

Given this test model, we can generate a Monte Carlo 
sequence of mock time series by adding simulated random 
errors to the probe sinusoid and generating random timings 
according to the specified gapping pattern. In our case. A'' 
timings ti were distributed uniformly within n = 10 period- 
ically gapped intervals [kPg, ik + f)Pg], with / € [0, 1] being 
a filling parameter (fixed during each simulation series) . For 
each such simulated data set, we evaluate the Vuong statis- 
tic V comparing the two primary aliases near ujoiojg. After 
that, we compare the resulting simulated distribution of V 
with the expected standard normal distribution. 

The results of the simulations are shown in Fig. [T] In 
these diagrams, the axes show expected and simulated sig- 
nificance levels in the so-called n-a notation, which is con- 
venient when dealing with normal or almost normal proba- 
bilities. For the expected significance (in the abscissas) this 
is just equal to an observed value of |V|. Each value in the 
ordinates, S, represents such critical value for a hypotheti- 
cal standard normal varaible x that the probability for |x| to 
exceed S is equal to the actual simulated probability for |V| 
to exceed the value in the abscissa. Larger S implies higher 
significance level and smaller FAP|3 If V indeed follows the 



^ The cases when the true signal period is never taken into ac- 
count are not really artifical. For example, the true orbital period 
of 0.7 days for the extrasolar planet 55 Cn c e was hidden for a 
very long time behind an alias of 2.8 days l l Dawson fc Fabrvckvl 
I2OICI) . This happened only because the researchers did not plot 
the periodograms beyond the 1 day period bound, until recent 
time. 

* We remind that the frequently used one-, two-, and three-sigma 
levels correspond to FAP = 31.7%, 4.6%, and 0.27%, and in gen- 
eral FAP = 2<1>(5). 
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expected "n-a" significance, |V| expected "n-a" significance, |V| expected "n-a" significance, |V| 

Figure 1. Predicted vs. simulated significancc levels for the Vuong test, depending on various parameters. Abscissas show the absolute 
value of the Vuong statistic, while the ordinates show the actual (simulated) significance level in the "n-cr" notation. If the Vuong statistic 
was perfectly normal then the simulated significance level would be SdV]) = |V|, and the corresponding graph would strictly follow the 
diagonal line. See text for further details. 



expected standard normal distribution then 5(|Vj) = |V|, 
and its graph should strictly follow the main diagonal. If, 
for instance, V has some non-unit variance ay and still nor- 
mal then S{\V\) = \V\/av. 

In Fig.[T]we show the graphs of SdV]) varying diflerent 
parameters of the test model. Overall, the agreement be- 
tween simulated and expected significances is good, except 
for certain rather boundary cases. The bad cases include: 
small number of the observations A'^ < 30, very small sig- 
nal amplitude A < a, or too weak aliasing / > 70%. The 
two latter bad cases correspond to a small signal/noise ra- 
tio of the aliases involved, so that we even cannot be sure 
that at least one of them is actually significant. For example, 
/ = 100% corresponds to no aliasing at all, when the two 
rival aliases in our test model are destroyed. Then the test 
attempts to compare noisy peaks which appeared in these 
positions occasionally. Such cases are not practical, since in 
practice the significance of at least one of the alternatives is 
already established before we ask which one is true. 

Summarizing these results, we can derive the following 
empiric condition of the Vuong test applicability: 

A' /a > ^No/N, (18) 

where A' is the best fitting amplitude of the rival periodici- 
ties, and A^o is a constant. The amplitude A' is smaller than 
the amplitude A of the original generating signal, since A' 
refers to the alias periodicities, while the main period is ne- 
glected. For our gapped time series we can readily obtain 
an analytic approximation for A' using the classical period 
analysis formulae (e.g. iVitvazev.i2001i ): 

A'^a'^. (19) 
1"/ 

In practice, it is more convenient to deal with the corre- 
sponding periodogram peak value rather than with the best 
fitting signal amplitude. We can rewrite psp in a very simle 
form 

^ - A ~ 20 = -p. (20) 




expected "n-a" significance, |V| 

Figure 2. Predicted vs. simulated significance levels for the 
Vuong test, depending on the true error distribution. The notes 
from Fig. [T] apply here. Always assuming the Gaussian model JTJ 
in the Vuong test, we perform simulations using several non- 
Gaussian error models: flat p.d.f. in the range [—1,1], Laplace 
p.d.f. <x e~l^l, and Cauchy p.d.f. oc 1/(1 -l-x^). Only the Cauchy 
distribution causes significant effect on the statistic V. 



When applying pOf) in practice, we should simply check 
whether the rival periodogram peaks that we want to com- 
pare are both above the Zq level. For out test problem, we 
find A^o — 50 and Zf^ ~ 10, although we must note that 
these thresholds may increase when e.g. we compare highly 
non-linear models. 

So far we assumed that the measurement errors distri- 
bution is always Gaussian. What if in true this distribution is 
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Figure 3. Predicted vs. simulated significance levels for the 
Vuong test, assuming fixed timings ti. The notes from Fig. [l] 
apply here. Here we vary only the signal amplitude. 



different from the one used in the Vuong test? Theoretically, 
the behaviour of the Vuong test should remain basically the 
same (in the most cases). This is verified by the simulations 
presented in Fig. [2] We can see that most error distribution 
that we tried out did not introduce any significant changes 
indeed. Only the Cauchy distribution produced some large 
systematic effect. This is not very surprising, because vari- 
ous pecularities of the asymptotic behaviour are typical for 
the Cauchy distribution, due to its heavy tails. 

ft is also interesting to investigate the case of fixed ti 
(Fig. [3]). In this case, the distribution of the Vuong statistic 
may become significantly more concentrated, in comparison 
with the random ti case. We find that the main critical pa- 
rameter in this case is the signal amplitude. For small A, the 
distribution of the Vuong statistic does not depend much on 
whether we assume fixed or random timings. For large yl, 
the scatter of the Vuong statistic shrinks, and the simulated 
significance level for the fixed ti becomes larger than for the 
random ti. This probably occures because large amplitude 
scales up the sensitivity of li in ((9]) to the small fiuctuations 
in ti. It is practically important that the standard normal 
distribution still allows us to assess a lower bound on the sig- 
nificance of |V|. This means that the number of false alarms 
still remains limited according to our request, although it 
may appear smaller. In addition, the cases of large ampli- 
tude are not very important, since in practice a period am- 
biguity is rare when the signal is so strong. We would like 
to emphasize that in practice the timings ti indeed usually 
have random nature that should not be neglected. 



5.2 Test efficiency 

Now let us consider the second main issue - the sensitivity 
of the Vuong statistic to the cases when one rival model is 
indeed better than the another one. In this case, we may 



consider the same sinusoidal test signal and simply com- 
pare the main period with one of the aliases, e.g. the pri- 
mary one having frequency ujo + ^g- The average value of 

V should be now biased to positive values, since the first 
(correct) model should look systematically more likely. For 
each simulation trial, we may have three types of outcomes: 
a success (|V| exceeds some positive critical value V* and 

V is positive too), a failure (|V| > V* but V < 0), and a 
neutral (inconclusive) outcome (|V| < V*). We denote the 
respective probabilities as Ps, Pf, and Pi. All of them are 
functions of the critical level V*, which can be tied to the 
rejection significance level S{V,) « V* or to the correspond- 
ing false alarm probability. Speaking in terms of Section [2j 
the quantities Pf, Pi, and FAP represent the probabilities 
to make the first, second, or third kind mistake. In addition, 
we can now define a few other indicators of the test perfor- 
mance: the failures/successes probabilities ratio, FS, the fail- 
ures/inconclusives ratio, FI, and the successes/inconclusives 
ratio, SI. Obviously, 

1/Ps = 1 + FS + SI, 

1/Pf = 1/FS + 1/FI + 1, 

1/P, = 1 -I- 1/SI + FI, 

FI = FS/SI. 



(21) 



Only two variables involved in these equalities can be treated 
as independent. We choose to use FS and SI as these basic 
characteristics. These two quantities characterise, how fre- 
quent are mistakes of the first and second kind, in compar- 
ison with successful classifications (given fixed probability 
of the third kind mistakes, FAP). The quantity FS can be 
also interpreted as a measure of how much our test exceeds 
a simple drop of a coin (which has FS = 1). The quantity 
SI measures the test conservativeness. A perfect test should 
possess small values of FS and SI. 

Together with the requested significance level, we now 
have three independent characteristics FS, SI, and S, which 
are all functions of a single control parameter - the critical 
value V*. Therefore, to fully characterise the test perfor- 
mance, we should investigate the corresponding paramet- 
ric curve in the three-dimensional space (FS,SI, 5"). This 
is a bit more complicated than, for example, in the usual 
signal detection task, when we have only two independent 
variables — the false alarm probability and the probability 
of wrong non-detection. We therefore have no other option 
than to deal with some two-dimensoinal projections. Since 
the bahaviour of 5'(Vt) « V* has been already investigated 
in details, we now look at the pair (FS, SI). 

In the Fig.Hwe plot several graphs of (SI(V*), FS(V.)) 
as parametric curves, setting different values for the signal 
and time series parameters. On each curve we also mark a 
few points corresponding to the values of V* from 1 to 3. As 
we could expect, increasing the signal amplitude A, or the 
number of observations A'^, or the time series filling factor 
/ decreases both the probability of a misclassification and 
of an inconclusive result. We may note that when varying 
different control parameters we usually obtain very similar 
curves. This indicates that there should be a single quan- 
tity (a combination of A'', A, and /) that defines the test 
performance for our task. On the basis of the presented sim- 
ulations, we can empirically construct this critical quantity 
as 
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inconclusives / successes inconclusives / successes inconclusives / successes 



Figure 4. Simulated performance of the Vuong test, depending on various parameters. Here we inspect the behaviour of the Vuong 
statistic comparing the primary ("true") peak near the main frequency ujq and its aUas near ojq + '^g (the exact values were adjusted to 
achieve a local maximum of the periodogram). In each curve we also put five points marking the positions of V* = 1, 1.5, 2, 2.5, 3, which 
approximately correspond to the same significance levels (S), i.e. from one-sigma to three-sigma. The "failures" refer to the cases when 
the Vuong test suggested to select the alias peak. The number of Monte Carlo simulations was 10®. See text for further details. 



a 



(22) 



Three curves in each panel of the Fig.|4]correspond to G ~ 1, 
G « 2, and G « 3. The formula (|22[) can be also justified 
theoretically. Indeed, since the Vuong statistic is based on 
the quantity L, which is expressible as the difference (|14p 
between two rival periodogram values, the latter difference 
should represent a critical parameter characterising the test 
performance. For the primary peak and its alias, this differ- 
ence looks like 
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(23) 



This is not a rigorous proof of (I22p . however. 

We can see that the Voung test can manage the mis- 
classification errors very well, but it favours a relatively large 
number of inconclusive results. This is not very surprising, 
since this test is originally designed to be conservative in 
drawing definite conclusions. This is the main reason why it 
suppresses the misclassifications so efficiently. From the pre- 
sented simulation results, we may also note that in practice 
it does not make much sense to request very high significance 
level from the Vuong test. Requesting large value of S{V*) 
can further suppress the third-type mistakes, but by the cost 
of a dramatic increase in the fraction of inconclusive results, 
which can even become dominating. Simultaneously, the rel- 
ative fraction of the first-type mistakes (misclassifications) 
either is not decreased very much or is already sufficiently 
small for moderate S. Therefore, we believe that in prac- 
tice it is enough to set the required significance level about 
1.5 — 2 sigma. 

In practice we often deal with more complicated situa- 
tions, however. In addition to the true period, we may have 
also multiple aliases, and there is no guarantee that the true 
period is indeed associated to one of the two largest pe- 



riodogram peaks. Both selected highest peaks may occure 
aliases. In that case, we need to answer a principal question: 
what event we should treat as a failure of the test? Appar- 
ently, a failure should occur each time when the test suggests 
to choose an alias period. However, testing for multiple al- 
ternatives is not the responsibility of the Vuong test. The 
only thing that we required from this test, by its design, is 
to choose a better solution among the two ones provided on 
input. From this view point, the failures only occur when the 
test suggests to choose the peak which is more distant from 
the true period than the alternative candidate. The cases 
when the test selects an alias peak, while the second can- 
didate is a larger-order alias, should be treated as successes 
(even if the selected peak is an alias too). 



We performed simulations for both interpretations. To 
increase the contrast, we also added some penalty in the 
first interpretation, counting each selected fc-order alias near 
ij = LUoizkidg as k failures at once. The results appeared ap- 
parently the same, however. This indicates that the cases, 
when both the largest periodogram peaks are only aliases, 
are rare and usually trigger an inconclusive decision of the 
Vuong test. The results of the simulations are shown in 
Fig- El In fact, this figure differs from Fig. [4] mainly because 
it is now allowed to choose any of the two first-order aliases 
near ojq ± "^s, instead a single alias alternative luq + ujg. A 
high-order alias may be selected too, but, as we have just 
discussed, the probability of such an event appears negligi- 
bly small. Comparing Fig. [S]with Fig. O we note two main 
differences. First, the fraction of inconclusive results is in- 
creased, as well as its sensitivity of the requested value of 
V*. Second, the threshold value of G, necessary to suppress 
these inconclusive decisions, grew roughly by half, and the 
sensitivity of the simulated curves to the value of G is also 
increased. To have, say, only half of inconclusive decisions, 
given 1.5-sigma or 2-sigma confidence level (5), the value of 
G should be about 4.5 or 3.5. 
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Figure 5. Simulated performance of the Vuong test, depending on various parameters. Similar to Fig. [4] but instead of picking two given 
periodogram peaks (the main one and a given alias) we now inspect two largest peaks among many period candidates near ujq + kujg. 
The parts of the curves corresponding to V» values between 2.5 and 3 are very unreliable here, regardless a larger number of Monte Carlo 
simulations (10^). These curves appeared apparently the same for two different definitions of a "failure": (i) the cases when the Vuong 
test recommends to select a peak, while the second candidate is actually better, and (ii) the cases when the Vuong test recommends to 
select any alias peak (not the true one). In the second interpretation, selecting a fc"'-order alias was counted as k failures at once. The 
plotted curves are for the first interpretation. See text for further details. 



6 APPLICATIONS 

6.1 Exoplanetary system orbiting 55 Cnc 

For many years it was believed that the planet 55 Cnc e, 
the innermost planet in the 55 Cnc system, posesses an 
orbita l period of « 2.8 day, until lOawson fc Fabrvckvl 
l|2010t ) showed that this apparent periodicity is actually a 
diurnal alias of the true one with Pe ~ 0.7 day. This new 
period value allows for a significantly better fit of the avail- 
able data. This true period value could hide for so long time 
only because the researchers did not consider any potential 
periods smaller that 1 day. Here we would like to rigorously 
compare these alternative period values utilising our new 
method. We use Li ck and Keck radial velocity data from 
l|Fischer et al.l |2008| ) to obtain two alternative five-planet 
fits with Pe ~ 0.7 day or Pe ~ 2.8 day. With the use of 
multi-Keplerian (unperturbed) model of the radial velocity 
we obtain V « 5.6 for these alternatives. Using A''-body New- 
tonian radial velocity model, as described in (Baluev 201ll). 
we have V ~ 5.5 (assuming the system is seen edge-on). No 
doubts, the period of 0.7 day is indeed the correct one. 

We may also carry out another, somewhat unusual, 
model comparison. Estimated planetary orbital parameters 
and masses, as well as the fit quality, are a bit different for 
Keplerian and Newtonian models of radial velocity. So we 
might ask: are these differences statistically significant? In 
other words, can we say that mutual planetary perturba- 
tions are already detected in the RV data? To answer this 
question, we just need to find two orbital fits, one using Ke- 
plerian and another using Newtonian model, and then to 
calculate the Vuong statistic for these models. We obtain V 
of only 0.2. Allowing for the common orbital inclination to 
the sky plane to float during the fitting, we get V ~ 0.4 (with 
the best fitting inclination of 16°). So small values of V say 
that possible gravitational perturbations in this system are 
not yet detectable from RV curve. Therefore, any attempt 



to constrain orbital inclinations in this system form the RV 
data (on the basis of potential interplanetary perturbations) 
would be probably meaningless, possibly except for putting 
some very mild and thus not too much useful limits. 

6.2 Exoplanetary system orbiting HD75898 

This star was observed by the Keck team ([Robinson et al.l 
l2007l ). who reported the discovery of a Jovian-mass planet 
having orbital period of about 400 days. The relevant pe- 
riodogram shows actually two comparable peaks at the pe- 
riods of 400 days a nd 200 days (see Fig . [6l). L et us try to 
verify the results bv lOawson fc Fabrvckvl (|2010l ). who inves- 
tigated this aliasing ambiguity more carefully and confirms 
that the 200-day period should be rejected indeed. We basi- 
cally agree: V ~ 3.5 in this case, so we can safely choose the 
period of 400 days, since the difference between the models 
has very high significance of more than 3ct. 

6.3 Exoplanetary system orbiting GJ876 

The planetary syste m orbiting GJ876 is currently believed 
to host four planets (|Rivera et al.|[2O10l ). The detailed anal- 
ysis of all up-to-date radia l velocity data for t his system, 
including recent H ARPS (ICorreia et al.l 2010l) an d Keck 
(|Rivera et al.ll2010l ) data, is given in (|Baluevll20l"ll ). along 
with the full orbital configuration details. Here we would 
like to consider two ambiguities associated to this planetary 
system. 

The first issue is related to the orbital period of the 
innermost planet GJ8 76 d. Since the v ery discovery of this 
planet it was noted (|Vogt et al.l l2005i ) that (presumably) 
the primary period value Pd ~ 1.938 day is accompanied 
by a diurnal alias of Pd ~ 2.055 day. The analysis done by 
iDawson fc Fabrvckvl (|2010l ) supports this classification, as 
well as our calculations, which yield V ~ 2.0 for this alias 
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Figure 6. Periodogram of HD156668 radial velocity data, showing two high peaks at the period values of 400 days and 200 days. The 
Vuong test suggests that these peaks are distinguishab le at 3.5cr level, thus we ma y definitely accept the peak at 400 days as the correct 
solution for the planet b. Here we perfectly agree with lDawson fc Fabrvckvl i201(t) . 



ambiguity. Therefore, here we are able to rigorously confirm 
that these periods indeed are well-distinguishable and the 
correct period is 1.938 day. 

The second issue that we would like to highlight con- 
cerns the determin ability of the p lanet GJ876 e eccentricity, 
ee. It is noted in (|Baluevl l201ll ) that this eccentricity, al- 
though is bounded by ~ 0.2 from the upper side, looks still 
ill-determined below this limit. In particular, we have two 
comparable local minima of the likelihood function: the first 
one at ee « and another at ee ~ 0.15 (with the corre- 
sponding pericenter argument lu^ ~ 45°). Therefore, this 
ambiguity looks like a good task suitable for the statistical 
methods proposed in the present paper. We find that the 
Vuong test agrees that these two orbital solutions are ob- 
servationally indistinguishable: V varies within the range of 
0.5 — 1, depending on some minor model details. This result 
allows us to confirm more rigorously the conclusion that ee 
is indeed ill-determined^ 

As it is shown in (l Baluevll201ll '). both HARPS and Keck 
data for GJ876 contain significant autocorrelated compo- 
nent (red noise). We must acknowledge that this red noise 
could make the Vuong test not very reliable, because such 
measuremets are not uncorrelated. Nevertheless, in practice 
taking this red noise into account usually increases various 
statistical uncertainties, so apparently distingushable mod- 
els could appear actually equivalent, but hardly vice versa. 

We do not believe, however, that this red noise could 
affect our previous disambiguation of the planet d period. 
The two alternatives for this period reside very close to each 
other, while the effect of the correlated noise is spread over 
the whole frequency spectrum. It could distort the balance 
between some distant periodogram peaks, but not between 
these ones. 



6.4 Exoplanetary system orbiting GJ3634 

According to iBonfils et al.l (|201ll ). this star is orbited by a 
super-Earth planet each 2.6 day. As the discoverers note, in 



Table 1. Comparing three alias alternatives for GJ3634 b: values 
of V obtained for various alias pairs 



Alias 



1.60 d 2.67 d 



2.65 d 
1.60 d 



2.4 
0.8 



addition to the periodic signature of this planet, the radial 
velocity data also contain a parabolic long-term drift. The 
periodogram of the RV data with this trend removed (that 
is, included in the base model) is shown in Fig. [7| This pe- 
riodogram shows two apparent peaks, with the peak at ap- 
proximately 2.6 days being actually double, consisting of a 
close pair at the periods of 2.65 day (height 32) and 2.67 day 
(height 18). The discovery team does mention the period of 
1.6 day, but they just retract it as an alias without any rig- 
orous justification. They also do not mention that the peak 
at 2.6 day is double. The results of our analysis are given 
in Table [1] where each cell contains a numerical value of V, 
comparing the aliases marked in the left column and in the 
top line. We can see that the peak at 2.67 day can be safely 
retracted in favour of its neighbour at 2.65 day (significance 
2.4(t). The period value of 1.6 day is also rather unlikely, 
albeit now V ~ 1.8, which is only moderately significant. 



6.5 Exoplanetary system orbiting HD156668 

On the basis of the Keck observations, [Howard et all (|201lD 
reported that this star is orbited by a super-Earth hav- 
ing short period of 4.6 d ays or, possibly, 1.2 days. As 
iDawson fc Fabrvckvl (|2010l ) claim, the correct value of the 
period should be 1.2 days, and the peak at 4.6 day is only its 
diurnal alias. Let us try to verify these results too. First, we 
can see that there are actually more that two possible peri- 
ods at stack (Fig.[8l top). For further investigation we select 
four periodogram peaks for periods longer than half-day. We 
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Figure 7. Residual periodogram of GJ3634 radial velocities, with long-term quadratic drift removed. We can see two peaks here. The 
peak at 2.6 days is actually double. The Vuong test suggests to select the higher peak at 2.6 day in comparison with the peak at 1.6 day, 
but at a moderate significance of l.Scr. The smaller peak near 2.6 day can be rejected at 2Aa. 



Table 2. Comparing four alias alternatives for HD156668 6: val- 
ues of V obtained for various alias pairs 

AUas 4.6 d 0.56 d 800 d 

1.2 d 0.9 0.8 0.5 
4.6 d — 0.3 0.2 
0.56 d — 0.0 



fit four corresponding models assuming circular orbil[f| and 
calculate tlie Vuong statistic for each pair of these models. 
The results are given in Table [2] Investigating this table, 
we can see that the Vuong statistic is always small (< 1). 
Thus our results do not s upport the conclusion made by 
iDawson fc Fabrvckvl (|201(]| ) concerning this planet. 

We prefer to remain cautious also because it seems that 
some extra variations may be present in the radial veloc- 
ity of this star. The residual periodogram with, say, 1.2- 
day oscillation subtracted off, demonstrates that the peak 
at 800 days does not disappear (Fig. [HI bottom). Its height 
(and thus significance) appears the same as for the first pe- 
rio d that we extracted . Indeed, using the methods described 
in (|Baluevll2008l , [20091 ) we find that this peak infers the false 
alarm probability of only 0.1%. It does not matter if we try 
to extract first the periodicity of 4.6 day or 0.56 day instead 
of 1.2 day, the 800-day peak remains here. On contrary, ex- 
tracting this 800-day peak at the first place does not elim- 
inate the three remaining peaks at short periods. Speaking 
shortly, three short periods are mutual aliases of each other, 
while 800-day period is a standalone entity. 

This long-period variation might represent a hint of an 
extra planet, or maybe of some other pheno menon, like th e 
red noise effect in the RV data for GJ876 l|Baluev|[2oTll) . 



^ The orbital eccentricity looks ill-determined in this case, be- 
cause it often rises to unrealistic values. Also, due to the famous 
tidal orbit circularization effect, it is unlikely that an eccentric 
orbit could reside so close to the star. 



Indeed, similarly to the GJ876 case, we can see in this peri- 
odogram a broad band of low-frequency peaks without any 
standalone clearly dominating peak (either 800 days or any 
other). Just like any usual harmonic periodicity, this low- 
frequency band produces a diurnal alias band around the 
period of 1 day. This diurnal band is actually clearly seen 
in the periodogram, and it might easily enforce the 1.2- 
day peak, relatively to the 4.6-day one. This argumentation, 
along with a small value of the Vuong statistic, force s us to 
disobey the recommendation bv lDawsori fc Fabrvckvl (|2010D 
to leave only 1.2 day period as the correct solution. We do 
not find a rigorous statistical basis for this in the current 
data. 



7 CONCLUSIONS AND DISCUSSION 

We have introduced the Vuong test, which is an asymptot- 
ically normal statistical test for verifying the observational 
equivalence of alternative models. This method compare the 
alternative models in the sense of certain statistical measure 
of their divergence. The Vuong statistic uses the KuUback- 
Leibler Information Criterion to establish whether the mod- 
els are equivalent or not. Importantly, this test does not 
require that one of the rival models should be correct. In- 
stead, both alternatives may occur formally wrong (misspec- 
ified case). Such models are still correctly ordered in terms 
of the adopted divergence measure. We classify this as a 
high practical advantage, since in practice no model is pre- 
cisely correct: any data always contain some hidden residual 
variations and the true error distribution may deviate from 
the normal function. We should however admit that any or- 
dering of the alternatives is not very useful if all available 
alternatives are too far from the truth. Another important 
advantage of our test is due to its practical simplicity and 
calculation speed. It is fully analytic and do not require any 
CPU-consuming simulations in the routine use. 

Throughout the paper, we mainly focused our attention 
on the period ambiguity issue, since this is one of the most 
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Figure 8. Top; periodogram of HD156668 radial velocity data calculated according to llBaIuevll2009l ) and involving only a constant term 
in the base model. Bottom: similar residual periodogram with planet b sinusoidal signature embedded in the base model as well (and 
assuming the period Pj, 1.2 days). In the second periodogram, the peak at 800 day s remains intact (and even enforced). It has the 
same significance as the 1.2-day peak in the first periodogram. Here we disagree with IPawson fc Fabrvck^ l|2010l ). who suggested the 
1.2 days peak as a definite solution. 



practical task. However, the Vuong test can be also applied 
to check the model equivalence in other situations, when the 
ambiguity is related to any other parameters (not necessarily 
the period). The main necessary formulae of the test remain 
basically the same. 

We have reanalysed the radial velocity data for several 
extrasolar planetary systems. We verified that orbital peri- 
ods of the planets 55 Cnc e, HD75898 b and GJ876 d can be 
resolved unambigiously on the basis of presently available 
data, while the period of HD156668 b and the eccentricity 
of the planet GJ876 e are still ambigious. 

A popular rival approach for solving such dis ambigua- 
tion tasks is the Bayesian model selection (e.g. iGregorvl 
[2003)- However, the great disadvantage of the Bayesian 
methods is their computational complexity requesting for in- 
tensive numerical simulations. The Vuong test is completely 
free from this disadvantage. Also, when dealing with a task 
like distinguishing between different alternatives, we should 
always have an adequate impression of which statement is 
derived from the real observations, and which one follows 
from an a priori assumption not tied to the actually ob- 



served data. This impression is difficult to obtain if we use 
Bayesian statistics, especially when we compare two solu- 
tions from very different regions of the parameter space. 
Basically, Bayesian methods mix the subjective prior as- 
sumptions with the objective information derived from the 
real data, which makes them, in our view, often unsuitable 
for model intercomparison. We do not decline the known 
strengths of the Bayesian methods, but we must m ention 
that debates on the Bayesianism still do not cease l|Efronl 
1 19861 : iGelmanllioog ). In practice, it is always necessary to 
verify that the results of the Bayesian analysis are stable 
with respect to choosing different reasonable prior distribu- 
tions. If this stability is not ensured, then there is a risk that 
such analysis is in fact more based on the prior assumptions 
rather than on the real data. 

The Vuong test is not Bayesian, but it does not implic- 
itly assume any specific prior distributions. Although sub- 
jective assumptions are never purged out completely, we can 
pursue to minimize their influence. The Vuong test attains 
good resistence to this undesired influence. We count it as 
its great advantage. 
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APPENDIX A: MEANING OF THE RANDOM 
TIMINGS 

It is an essential pecularity of the Vuong test that it treats 
the timings and measurement uncertainties as random quan- 
tities. This randomness is not related to possible inaccura- 
cies in the determined values of U and ai. Instead, it is 
related to the observations scheduling, which is treated as 
a random process. This issue should be clarified in more 
details now. 

In the astronomical practice, it is rarely taken into ac- 
count that the time of an observation is often a random 
quantity, broadly analogous to the random measurement er- 
ror. It is usually assumed that ti (as well as ai) are fixed a 
priori for each data set analysed. It is assumed that all statis- 
tical uncertainties in any data-derived quantities are caused 
only by random errors in Xi. Basically, we implicitly embed 
the observed data set into a general ensemble of hypothet- 
ical similar time series, each having its own sequence of Xi 
and still the same sequence of ti. This restrictive assumption 
about ti may have undesired effect sometimes. To demon- 
strate this, assume that we are carrying out a large survey 
of many astronomical targets. In practice, the timings of 
individual observations are always distributed according to 
some observational time window, which often shows some 
regular patterns (e.g. periodic gaps), but within this time 
window they are usually distributed pretty randomly. Usu- 
ally we cannot take a snapshot of all targets at once, so each 
target has individual time series with different sequence of ti. 
Under such circumstances, the random scatter of any data- 
derived quantity (e.g., some parametric estimation or a test 
statistic) incorporates an extra contribution inferred by the 
random fluctuations of ti. When we subsequently apply our 
statistical procedure to each target of such survey, we should 
encounter, in general, more blurred uncertainties and more 
frequent false alarms than we can expect assuming that all 
ti are flxed. Therefore, to deal with this situation properly, 
we should consider another general enseble of time series, 
which admits of random variations in ti and ai, as well as 
in Xi. 

We do not try to scare the reader by claiming that 
all results of any data analysis done so far should be now 
rechecked taking into account the possible fluctuations of ti. 
The quantities used traditionally in the astronomical data 
analysis, are often statistically invariable with respect to the 
exact sequence of ti. "Statistically invariable" means here 
that the corresponding distribution function is invariable 
with respect to fluctuations of ti within any given distribu- 
tion pattern, although individual values of the test statistic 
usually do depend on ti. For example: 

(i) It is well-known that the values of the Lomb-Scargle 
periodogram are exponentially distributed, regardless the 
actual sequence of timings. Consequently, they remain ex- 
ponentially distributed for the random ti too. 

(ii) According to I Baluev! l|2008f ) , the tail distribution of 
the maximum peaks of the Lomb-Scargle periodogram can 
be approximated by a formula We'^y'^, where W is pro- 
portional to the sample variance of ti. For large A^, the lat- 
ter variance is practically invariable with respect to ran- 
dom fluctuations of ti, if these fluctuations obey some well- 
specified time window. Therefore, the distribution of the 
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maximum periodogram peaks also should not be signifi- 
cantly affected by the fluctuating timings. 

(iii) The maximum-likelihood (or least-square) estima- 
tions of various model parameters are usually asymptotically 
unbiased for large A*', with the uncertainty approximately 
proportional to 1/y/N. The distribution of such estimations 
is asymptotically invariable with respect to the random vari- 
ations of ti, if these variations follow any specified time win- 
dow. The uncertainties of the estimations may often depend 
on this time window, however. 

When ti (and ai) are treated as fixed non-random quan- 
tities, the distribution of the Vuong statistic significantly 
depends on their exact sequence. This can be easily demon- 
strated. We can avoid dealing with any distribution of ti if 
we redefine the KLIC divergence as 



KLIC'i2(01,f2,z) = 



= — E" 



log 



Ci{x\z, 0i) 

C2(x\z,e2)' 



(Al) 



where the expression under the E" sign represents now the 
usual (observed) log-likelihood ratio for the two models, and 
the expectation itself is now taken conditionally to fixed Zi. 
This new divergence measure represents a direct analog of 
the quantity Q, but calculated for a particular discrete se- 
quence of Zi. There is no obstacle to use the same Vuong 
statistic ^ to test the new null hypothesis KLIC'12 = 0, 
which is apparently no worse than testing the hypothesis 
KLIC12 = 0. Indeed, the value of KLIC'12, which represents 
an average over the A'' timings, should converge to KLIC12 
for N ^ 00. Since we anyway consider only this asymptotic 
case, the two measures KLIC and KLIC' are just equivalent, 
and the quantity L in ((5| can be used to estimate both. How- 
ever, the variance of L is misbehaving: it has systematically 
different values for fixed Zi and for random Zi. This occures 
because random fiuctuations in Xi and in Zi generate com- 
parable contributions in the total variance of L. Thus the 
mentioned variance should be significantly smaller for fixed 
Zi than for random Zi. 

It is not hard to show this rigorously. Indeed, when Zi 
are treated random, all k have the same distribution, and it 
is easy to derive that 



5" 1 



^IJ, ^Z,.L = tlj, NIil^,L = nlj, (A2) 

where D stands for the variance operator. Da; = Ea;'^ — (Ex)^. 
We can see that in this case the normalised statistic V = 
Ly/N jv indeed should approximately follow the standard 
normal distribution. When Zi are fixed, the averaging over 
Zi in (|A1[) and any derived formulae can approximate the 
expectation E^. Applying this rule, we may derive 



e: 



J, 



■"|.^.(A3) 

We can see that did not attain any significant systematic 
bias, as well as L, while the variance of L is now different: 



N 

= lEz(E°|2/)' 



(E»E»|J) 



"1 I = 

X I z " 

J > 0. 



(A4) 



Therefore, if we consider our situation conditionally to 
fixed Zi, we find that the variance of the Vuong statistic is 
systematically smaller than unit. This variance deficit does 



not tend to zero when A'^ 
a constant value 



J°E° 



00. Instead, it stabilizes at 
It is difficult to apply the 



Vuong test in such situation, since then the exact variance 
of V is unknown even for N ^ 00, although we know that 
it cannot exceed unit. The variance deficit of V disappears 
only in a very specific case when E°|^Z is constant in z. 

However, we must emphasize again that the Vuong test 
does not request to specify the distribution of ti and ai ex- 
plicitly. In practice, we should not care about this distribu- 
tion at all, and the practical application of the test is easy. 
In other words, although the Vuong test is sensitive to the 
presence of random fiuctuations of individual timings, it is 
invariable with respect to their distribution relflecting the 
shape of the time window. 

We still can imagine practical cases falling out of the 
random interpretation of the timings. For example, when 
we take a sequence of images of the same field, it makes the 
timings non-random - they appear always the same for all 
targets of such fixed-field survey. In this case, however, the 
Vuong test does not stop working. As we have just shown, 
the variance of the Vuong statistic can only decrease when 
we move from random ti to fixed ti, so when we apply it 
to such fixed-field survey, we still obtain at least an upper 
limit on the false alarm probability. This means that if the 
Vuong test recommends to retract the null (equivalence) hy- 
pothesis at e.g. 99% level, we can safely retract it even if the 
sampling patterns are the same for all our targets: the true 
confidence level maybe possibly larger than 99%, but this 
only implies even larger significance. The only negative con- 
sequence is that in the case of so specific data sampling we 
can distinguish more close alternatives than usually, and the 
Vuong test does not catch this opportunity. That is, another 
test having better sensitivity in this specific situation may 
exist, but it is hardly as general as the Vuong one. 

This paper has been typeset from a Tj^X/ I^Tj^X file prepared 
by the author. 



